nsfw content
- Asia > China > Hong Kong (0.04)
- South America > Venezuela (0.04)
- Asia > China > Hong Kong (0.04)
- South America > Venezuela (0.04)
PLA: Prompt Learning Attack against Text-to-Image Generative Models
Lyu, Xinqi, Liu, Yihao, Li, Yanjie, Xiao, Bin
T ext-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-F or-W ork (NSFW) content. T o investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. T o facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework ( PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. W arning: This paper may contain offensive model-generated content.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
VModA: An Effective Framework for Adaptive NSFW Image Moderation
Bao, Han, Wang, Qinying, Chen, Zhi, Li, Qingming, Zhang, Xuhong, Li, Changjiang, Wang, Zonghui, Ji, Shouling, Chen, Wenzhi
Not Safe/Suitable for Work (NSFW) content is rampant on social networks and poses serious harm to citizens, especially minors. Current detection methods mainly rely on deep learning-based image recognition and classification. However, NSFW images are now presented in increasingly sophisticated ways, often using image details and complex semantics to obscure their true nature or attract more views. Although still understandable to humans, these images often evade existing detection methods, posing a significant threat. Further complicating the issue, varying regulations across platforms and regions create additional challenges for effective moderation, leading to detection bias and reduced accuracy. To address this, we propose VModA, a general and effective framework that adapts to diverse moderation rules and handles complex, semantically rich NSFW content across categories. Experimental results show that VModA significantly outperforms existing methods, achieving up to a 54.3% accuracy improvement across NSFW types, including those with complex semantics. Further experiments demonstrate that our method exhibits strong adaptability across categories, scenarios, and base VLMs. We also identified inconsistent and controversial label samples in public NSFW benchmark datasets, re-annotated them, and submitted corrections to the original maintainers. Two datasets have confirmed the updates so far. Additionally, we evaluate VModA in real-world scenarios to demonstrate its practical effectiveness.
- North America > United States > Illinois (0.04)
- Europe > Ukraine (0.04)
- Europe > Russia (0.04)
- (6 more...)
- Law (1.00)
- Government (1.00)
- Information Technology > Services (0.88)
- (2 more...)
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis
Wang, Longtian, Xie, Xiaofei, Li, Tianlin, Zhi, Yuhan, Shen, Chao
--T ext-to-image (T2I) models have significantly advanced in producing high-quality images. However, such models have the ability to generate images containing not-safe-for-work (NSFW) content, such as pornography, violence, political content, and discrimination. T o mitigate the risk of generating NSFW content, refusal mechanisms, i.e., safety checkers, have been developed to check potential NSFW content. Adversarial prompting techniques have been developed to evaluate the robustness of the refusal mechanisms. The key challenge remains to subtly modify the prompt in a way that preserves its sensitive nature while bypassing the refusal mechanisms. In this paper, we introduce T okenProber, a method designed for sensitivity-aware differential testing, aimed at evaluating the robustness of the refusal mechanisms in T2I models by generating adversarial prompts. Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content. Thus, we conduct a fine-grained analysis of the impact of specific words within prompts, distinguishing between dirty words that are essential for NSFW content generation and discrepant words that highlight the different sensitivity assessments between T2I models and safety checkers. Through the sensitivity-aware mutation, T okenProbergenerates adversarial prompts, striking a balance between maintaining NSFW content generation and evading detection. Our evaluation of T okenProberagainst 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness in bypassing safety filters compared to existing methods ( e.g., 54%+ increase on average), highlighting T okenProber's ability to uncover robustness issues in the existing refusal mechanisms. The source code, datasets, and experimental results are available in [1]. Warning: This paper contains model outputs that are offensive in nature. The Text-to-Image (T2I) models have gained widespread attention due to their excellent capability in synthesizing high-quality images. T2I models, such as Stable Diffusion [2] and DALL E [3], process the textual descriptions provided by users, namely prompts, and output images that match the descriptions. Such models have been widely used to generate various types of images, for example, the Lexica [4] contains more than five million images generated by Stable Diffusion.
Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset
Muneer, Muhammad Shahid, Woo, Simon S.
In the past years, we have witnessed the remarkable success of Text-to-Image (T2I) models and their widespread use on the web. Extensive research in making T2I models produce hyper-realistic images has led to new concerns, such as generating Not-Safe-For-Work (NSFW) web content and polluting the web society. To help prevent misuse of T2I models and create a safer web environment for users features like NSFW filters and post-hoc security checks are used in these models. However, recent work unveiled how these methods can easily fail to prevent misuse. In particular, adversarial attacks on text and image modalities can easily outplay defensive measures. %Exploiting such leads to the growing concern of preventing adversarial attacks on text and image modalities. Moreover, there is currently no robust multimodal NSFW dataset that includes both prompt and image pairs and adversarial examples. This work proposes a million-scale prompt and image dataset generated using open-source diffusion models. Second, we develop a multimodal defense to distinguish safe and NSFW text and images, which is robust against adversarial attacks and directly alleviates current challenges. Our extensive experiments show that our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall, drastically reducing the Attack Success Rate (ASR) in multimodal adversarial attack scenarios. Code: https://github.com/shahidmuneer/multimodal-nsfw-defense.
- Oceania > Australia > New South Wales > Sydney (0.06)
- North America > United States > New York > New York County > New York City (0.05)
- Asia > South Korea > Gyeonggi-do > Suwon (0.05)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
Hyperbolic Safety-Aware Vision-Language Models
Poppi, Tobia, Kasarla, Tejaswi, Mettes, Pascal, Baraldi, Lorenzo, Cucchiara, Rita
Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at https://github.com/aimagelab/HySAC.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation
Zhao, Xin, Chen, Xiaojun, Xuan, Yuexin, Zhao, Zhendong, Jia, Xiaojun, Li, Xinfeng, Wang, Xiaofeng
The rise of deep learning models in the digital era has raised substantial concerns regarding the generation of Not-Safe-for-Work (NSFW) content. Existing defense methods primarily involve model fine-tuning and post-hoc content moderation. Nevertheless, these approaches largely lack scalability in eliminating harmful content, degrade the quality of benign image generation, or incur high inference costs. To address these challenges, we propose an innovative framework named \textit{Buster}, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Additionally, Buster employs energy-based training data generation through Langevin dynamics for adversarial knowledge augmentation, thereby ensuring robustness in harmful concept definition. This approach demonstrates exceptional resilience and scalability in mitigating NSFW content. Particularly, Buster fine-tunes the text encoder of Text-to-Image models within merely five minutes, showcasing its efficiency. Our extensive experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2\% while preserving the quality of harmless images.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (5 more...)
- Information Technology > Security & Privacy (1.00)
- Law (0.68)
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models
Zeng, Yaopei, Cao, Yuanpu, Cao, Bochuan, Chang, Yurui, Chen, Jinghui, Lin, Lu
Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.
- North America > United States > Pennsylvania > Centre County > University Park (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Exploring the Use of Abusive Generative AI Models on Civitai
Wei, Yiluo, Zhu, Yiming, Hui, Pan, Tyson, Gareth
The rise of generative AI is transforming the landscape of digital imagery, and exerting a significant influence on online creative communities. This has led to the emergence of AI-Generated Content (AIGC) social platforms, such as Civitai. These distinctive social platforms allow users to build and share their own generative AI models, thereby enhancing the potential for more diverse artistic expression. Designed in the vein of social networks, they also provide artists with the means to showcase their creations (generated from the models), engage in discussions, and obtain feedback, thus nurturing a sense of community. Yet, this openness also raises concerns about the abuse of such platforms, e.g., using models to disseminate deceptive deepfakes or infringe upon copyrights. To explore this, we conduct the first comprehensive empirical study of an AIGC social platform, focusing on its use for generating abusive content. As an exemplar, we construct a comprehensive dataset covering Civitai, the largest available AIGC social platform. Based on this dataset of 87K models and 2M images, we explore the characteristics of content and discuss strategies for moderation to better govern these platforms.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > New York > New York County > New York City (0.05)
- Asia > China > Hong Kong (0.05)
- (10 more...)
- Media > News (1.00)
- Information Technology > Security & Privacy (0.97)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.92)